Good, bad, ugly loops
So far, we have created a function that extract “our” features from a single text. Now, it is time to go over each row in dataset and apply the function on plot summaries. The first option that comes to the mind is using classical for loop. Caution! this could get really ugly. TL:DR: TRY TO AVOID “FOR LOOPS” in R unless you are familiar with primitive functions or dataset is small enough. I will briefly explain it in the next paragraph, but there is no heartbreak in skipping it.
According to Hadley Wickham’s Advanced R “For loops in R have a reputation for being slow. Often times, that slowness is due to creating a copy instead of modifying in place”. As an example, the function below is applying for loops on a dataframe. Since dataframe is quasi-built-in type of data in R and not a primitive function developed in C, each time for loop is being compiled, a new COPY of sample is created, and keyword and word_counts columns are being updated (Silver linings: after updates on R 3.1.0. the copy is not deep anymore). Imagine you would like to add a new furniture to your house, and you are building the same house from scratch in a new place and add that furniture in!! That makes the function really slow and ugly.
for_on_dataframe <- function(sample) {
for (i in 1:nrow(sample))
sample[i, c('keyword', 'word_counts')] <- extract_text_features(sample[i, 2])
return(sample)
}
As discussed above, using for loops on “data.frame” could get really ugly as your data size increases. However, we might be able to improve the performance through changing data.frame to list objects or using apply family functions. The apply functions in R do nor generally provide improved performance. It is stated in some resources that lapply can be a little faster because it does more work in C code than in R. In this tutorial, we could not test lapply alone as do.call(rbind) was required on top of that to get dataframe back. The chunk below includes 2 functions from apply family: lapply and mapply (a multivariate form of sapply) and a for loop on list object:
lapply_script <- function(sample) {
sample[, c('keyword', 'word_count')] <- do.call(rbind, lapply(sample[[2]], extract_text_features))
return(sample)
}
mapply_script <- function(sample) {
sample[, c('keyword', 'word_count')] <- mapply(extract_text_features, sample[, 2])
return(sample)
}
for_on_list <- function(sample) {
sample <- as.list(sample)
for (i in 1:length(sample[[1]])) {
temp <- extract_text_features(sample[[2]][i])
sample[['keyword']][i] <- temp[1]
sample[['word_count']][i] <- temp[2]
}
}
Obviously, extract_text_feature tasks can be done independently on each plot summary as we do not transfer any results from one iteration into the next one. So, we might be able to speed up the process through parallelizing the tasks on a machine with multiple cores. If you would like to dig deeper into the parallelzation, please check out this awesome page. The code chunk below is using doParallel library to implement parallel code and provide a parallel backend for the foreach package. In spite of its great advantages, parallelizing R processes have some challenges that we need to keep in mind. Well, “There ain’t no such thing as a free lunch”. Please feel free to skip the next paragraph if you are in a rush.
I have personally dealt with two main caveats in using parallel processes:
- mapping the task among multiple cores and reducing the results back usually take time and work. We would only get a substantial speed up if the task takes long enough for the overhead to be worth it. If you are interested in estimating the overhead time, you should look into the difference between
elapsed - user - systemof the Current session in 5 elementproc.timefunction. The print format ofsystem.timeonly includes total execution time of all sessions in different cores and might not tell the detailed story. - Accessing global variables and dealing with global state will be different than single thread execution. Since each processor will get a copy of the input vectors, any global modification within cores will create a copy of the object instead of modifying the original object. To overcome data handling issues, we should set
foreachin a way that it returns the value from the parallelized task:
extracts <- foreach::foreach (i = 1:nrow(sample), .combine = rbind) %dopar% extract_text_features(sample[i, 2])
parallel_loop_script <- function(sample) {
doParallel::registerDoParallel(num_cores)
extracts <- foreach::foreach (i = 1:nrow(sample), .combine = rbind) %dopar%
extract_text_features(sample[i, 2])
return(as.data.frame(extracts))
}
Parallel Effects
It is time to assess the performance of each loop function. We categorized the loops into The Good, The Bad, and The Ugly. Although the logic is more important than the categorization, a function could be ugly in a dataset and “good enough” in another one. Now, Let’s go ahead and plot the execution time of each method as summary plots gets larger and larger.
plot_summary <- data.table::fread('../data/plot_summaries.txt', header = FALSE, quote = '')
plot_summary <- as.data.frame(plot_summary)
batch_size <- 4000
row_number <- seq(batch_size, nrow(plot_summary), by= batch_size)
# If foreach is better let's change the behavior and use it while we can
runtime_list <- foreach (i = 1:length(row_number)) %dopar%
{
for_dataframe <- system.time(for_on_dataframe(plot_summary[1:row_number[i],]))
lapply_method <- system.time(lapply_script(plot_summary[1:row_number[i],]))
mapply_method <- system.time(mapply_script(plot_summary[1:row_number[i],]))
for_list <- system.time(for_on_list(plot_summary[1:row_number[i],]))
for_parrallel <- system.time(parallel_loop_script(plot_summary[1:row_number[i],]))
return(list(i, for_dataframe[[3]], lapply_method[[3]], mapply_method[[3]], for_list[[3]], for_parrallel[[3]]))
}
# change the list of lists to dataframe
runtime_df <- do.call(rbind.data.frame, runtime_list)
names(runtime_df) <- c('index', 'for_dataframe', 'lapply+do.call', 'mapply', 'for_list', 'parallel_loop')
# need to melt the dataframe to be able to line plot them
melted <- reshape2::melt(runtime_df, measure.vars = c('for_dataframe', 'lapply+do.call', 'mapply', 'for_list', 'parallel_loop'))
ggplot(melted, aes(batch_size*index, value, color = variable)) +
geom_line(aes(group = paste0(variable))) +
xlab('Number of summary plots') + ylab('execution time (seconds)') +
ggtitle('Execution time of loop functions as text dataset gets large')
# x <- 1:length(run_time1)
# plot(x, run_time1, type = 'l', col = 'blue')
# lines(x, run_time2, type = 'l', col = 'red')
# lines(x, run_time3, type = 'l', col = 'green')
# legend(x=1.5, y=1.5, legend=c("for", "mapply", 'dopar'), col = c('blue', 'red', 'green'), lty=1)
As figure above shows, “for loop on dataframes” has the ugliest performance at any size of the dataset. “for loop on list” and apply functions are performing better although there is not a big difference between them. The worse performance of lapply might be due to its implementation with do.call(rbind) function. The parallel implementation is providing the best performance as we are using all cores of the machine (in this example, I am running the scripts on a 8-core MacBook Pro with 2.8 GHz intel Core i7 and 16 GB 2133MHz DDR3 RAM). We have, thoroughly, discussed the reasons behind the performance of each set of implementations in Good, bad, ugly feature extraction.